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ABSTRACT 

We describe the design and implementation of a high per- 
formance cloud that we have used to archive, analyze and 
mine large distributed data sets. By a cloud, we mean an in- 
frastructure that provides resources and/or services over the 
Internet. A storage cloud provides storage services, while 
a compute cloud provides compute services. We describe 
the design of the Sector storage cloud and how it provides 
the storage services required by the Sphere compute cloud. 
We also describe the programming paradigm supported by 
the Sphere compute cloud. Sector and Sphere are designed 
for analyzing large data sets using computer clusters con- 
nected with wide area high performance networks (for ex- 
ample, 10+ Gb/s). We describe a distributed data mining 
application that we have developed using Sector and Sphere. 
Finally, we describe some experimental studies comparing 
Sector/Sphere to Hadoop. 

Categories and Subject Descriptors: H.2.8 [Database 
Management]: Data mining, C.2.4 [Computer-Communications 
Networks]: Distributed applications, D.4.3 [Operating Sys- 
tems]: Distributed file systems, D.4.1 [Process Management] : 
Multiprocessing/multiprogramming/multitasking 

General Terms: design, experimentation, measurement, 
performance 

Keywords: distributed data mining, cloud computing, high 
performance data mining 

1. INTRODUCTION 

Historically, high performance data mining systems have 
been designed to take advantage of powerful, but shared 
pools of processors. Generally, data is scattered to the pro- 
cessors, the computation is performed using a message pass- 
ing or grid services library, the results are gathered, and the 
process is repeated by moving new data to the processors. 

This paper describes a distributed high performance data 
mining system that we have developed called Sector/Sphere 
that is based on an entirely different paradigm. Sector is 



designed to provide long term persistent storage to large 
datasets that are managed as distributed indexed files. Dif- 
ferent segments of the file are scattered throughout the dis- 
tributed storage managed by Sector. Sector generally repli- 
cates the data to ensure its longevity, to decrease the latency 
when retrieving it, and to provide opportunities for paral- 
lelism. Sector is designed to take advantage of wide area 
high performance networks when available. 

Sphere is designed to execute user defined functions in par- 
allel using a stream processing pattern for data managed 
by Sector. We mean by this that the same user defined 
function is applied to every data record in a data set man- 
aged by Sector. This is done to each segment of the data 
set independently (assuming that sufficient processors are 
available), providing a natural parallelism. The design of 
Sector/Sphere results in data frequently being processed in 
place without moving it. 

To summarize, Sector manages data using distributed, in- 
dexed files; Sphere processes data with user-defined func- 
tions that operate in a uniform manner on streams of data 
managed by Sector; Sector/Sphere scale to wide area high 
performance networks using specialized network protocols 
designed for this purpose. 

In this paper, we describe the design of Sector/Sphere. We 
also describe a data mining application developed using Sec- 
tor/Sphere that searches for emergent behavior in distributed 
network data. We also describe various experimental studies 
that we have done using Sector/Sphere. Finally, we describe 
several experimental studies comparing Sector/Sphere to Hadoop 
using the Terasort Benchmark 3 , as well as a companion 
benchmark we have developed called Terasplit that com- 
putes a split for a regression tree. 

This paper is organized as follows: Section 2 describes back- 
ground and related work. Section 3 describes the design of 
Sphere. Section 4 describes the design of Sector. Section 
5 describes the design of the networking and routing layer. 
Section 6 contains some experimental studies. Section 7 de- 
scribes a Sector/Sphere application that we have developed. 
Section 8 is the summary and conclusion. 



This paper is based in part on [15]. In particular, some 
of the introductory and background material is the same. 
This paper describes a later version of the Sector/Sphere 
system, describes the Sector/Sphere system in more detail, 



and contains different experimental results. 

2. BACKGROUND AND RELATED WORK 

By a cloud, we mean an infrastructure that provides re- 
sources and/or services over the Internet. A storage cloud 
provides storage services (block or file based services); a data 
cloud provides data management services (record-based, column- 
based or object-based services); and a compute cloud pro- 
vides computational services. Often these are layered (com- 
pute services over data services over storage service) to cre- 
ate a stack of cloud services that serves as a computing plat- 
form for developing cloud-based applications. 

Examples include Google's Google File System (GFS), BigTable 
and MapReduce infrastructure [5], [8]; Amazon's S3 storage 
cloud, SimpleDB data cloud, and EC2 compute cloud [17] ; 
and the open source Hadoop system [5], [21] . 

In this section, we describe some related work in high per- 
formance and distributed data mining. For a recent survey 
of high performance and distributed data mining systems, 
see [It?] . 

By and large, data mining systems that have been devel- 
oped to date for clusters, distributed clusters and grids have 
assumed that the processors are the scarce resource, and 
hence shared. When processors become available, the data 
is moved to the processors, the computation is started, and 
results are computed and returned [7]. In practice with this 
approach, for many computations, a good portion of the 
time is spent transporting the data. 

In contrast, the approach taken here by Sector/Sphere is to 
store the data persistently and to process the data in place 
when possible. In this model, the data waits for the task 
or query. The storage clouds provided by Amazon's S3 [I], 
the Google File System [8], and the open source Hadoop 
Distributed File System (HDFS) [3] support this model. 

MapReduce and Hadoop and their underlying file systems 
GFS and HDFS are specifically designed for racks of com- 
puters in data centers. Both systems use information about 
clusters and racks to position file blocks and file replicas. 
This approach does not work well with loosely coupled dis- 
tributed environments, such as those that Sector targets. 



To date, work on storage clouds [8] |3[ [1] has assumed rela- 
tively small bandwidth between the distributed clusters con- 
taining the data. In contrast, the Sector storage cloud de- 
scribed in Section [4] is designed for wide area, high perfor- 
mance 10 Gb/s networks and employs specialized protocols, 
such as UDT [13] , to utilize the available bandwidth on these 
networks. 

Sector is also designed for loosely coupled distributed sys- 
tems that are managed with a peer-to-peer architecture, 
while storage clouds such as GFS and HDFS are designed 
for more tightly coupled systems that are managed with a 
centralized master node. 
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Figure 1: A data stack for a cloud consists of three 
layered services as indicated. 



figured Sector processes a 1 TB file using 64 chunks, each 
of which is a file, while HDFS process the same data using 
8,192 chunks, each of which is a block. (The default block 
size for HDFS is 64 MB. We increased this to 128 MB for the 
experiments described below, which improved the Hadoop 
experimental results.) 

The most common way to code data mining algorithms on 
clusters and grids is to use message passing, such as pro- 
vided by the MPI library [To] , or to use grid libraries and 
services, such as globus-url-copy to scatter and gather data 
and programs and globus-job-run to run programs [7]. 

The most common way to compute over GFS and HDFS 
storage clouds is to use MapReduce [5*. With MapReduce: 
i) relevant data is extracted in parallel over multiple nodes 
using a common "map" operation; ii) the data is then trans- 
ported to other nodes as required (this is referred to as a 
shuffle); and, iii) the data is then processed over multiple 
nodes using a common "reduce" operation to produce a re- 
sult set. In contrast, the Sphere compute cloud described in 
Section [3] allows arbitrary user defined operations to replace 
both the map and reduce operations. In addition, Sphere 
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uses the same specialized network transport protocols 
that Sector uses so that any transfer of data required by 
Sphere's user defined functions can be transferred efficiently 
over wide area high performance networks. 

3. DESIGN OF SPHERE 
3.1 Overview 

The Sphere Compute Cloud is designed to be used with the 
Sector Storage Cloud. Sphere is designed so that certain 
specialized, but commonly occurring, distributed computing 
operations can be done very simply. Specifically, if a user 
defines a function p on a distributed data set a managed by 
Sector, then invoking the command 

sphere .run (a, p) ; 



applies the user defined function p to each data record in 
the dataset a. In other words, if the dataset a contains 
100,000,000 records a[i], then the Sphere command above 
replaces all the code required to read and write the array 
a[i] from disk, as well as the loop: 



for (int i = 0, 
p(a[i]); 



i < 100000000; ++i) 



In addition, Sector assumes that the data is divided into 
files, while GFS and HDFS divide the data into blocks that 
are scattered across processors. For example, as usually con- 



The Sphere programming model is a simple example of what 
is commonly called a stream programming model. Although 



this model has been used for some time, it has recently re- 
ceived renewed attention due to its use by the general pur- 
pose GPU (Graphics Processing Units) community (GPGPU 
community) [18] and by the popularization of the MapRe- 
duce [5] special case, which restricts attention to data of the 
form [key, value] and to two user defined functions (Map 
and Reduce). 

Large data sets processed by Sphere are assumed to be bro- 
ken up into several files. For example, the Sloan Digital Sky 
Survey dataset [9] is divided up into 64 separate files, each 
about 15.6 GB in size. The files are named sdssl.dat, . . ., 
sdss64.dat. 

Assume that the user has a written a function called find- 
BrownDwarf that given a record in the SDSS dataset, ex- 
tracts candidate Brown Dwarfs. Then to find brown dwarfs 
in the Sloan dataset, one uses the following Sphere code: 

Stream sdss; 

sdss . init ( . . . ) ; //init with 64 sdss files 
Process* myproc = Sector :: createJobO ; 
myproc->run(sdss , "f indBrownDwarf ") ; 
myproc->read(result) ; 

With this code, Sphere uses Sector to access the required 
SDSS files, uses an index to extract the relevant records, 
and for each record invokes the user defined function find- 
BrownDwarf. Parallelism is achieved in two ways. First, the 
individual files can be processed in parallel. Second, Sector 
is typically configured to create replicas of files for archival 
purposes. These replicas can also be processed in parallel. 

An important advantage provided by a system such as Sphere 
is that often data can be processed in place, without moving 
it. In contrast, a grid system generally transfers the data to 
the processes prior to processing [7]. 

3.2 Sphere Computing Model 

The computing model used by Sphere is based upon the fol- 
lowing concepts. A Sphere dataset consists of one or more 
physical files. Computation in Sphere is done by user defined 
functions (Sphere operator that take a Sphere stream as in- 
put and produce a Sphere stream as output. Sphere streams 
are split into one or more data segments that are processed 
by Sphere servers, which are called Sphere Processing Ele- 
ments or SPE. Sphere data segments can be a data record, 
a collection of data records, or a file. See Figure [2] 

When a Sphere function processes a stream, the resulting 
stream can be returned to the Sector node where it orig- 
inated, written to a local node, or "shuffled" to a list of 
nodes, depending upon how the output stream is defined. 

The SPE is the major Sphere service and it is started by a 
Sphere server in response to a request from a Sphere client. 
Each SPE is based on a user-defined function (Sphere op- 
erator). The Sphere operator is implemented as a dynamic 
library and is stored on the server's local disk, which is man- 
aged by the Sector server. For security reasons, uploading 
such library files to a Sector server is limited. A library 
file resides on a Sector server only if the Sphere client pro- 



gram has write access to the particular Sector server or the 
server's owner has voluntarily downloaded the file. Sector's 
replica service is disabled for Sphere operators. 

Once the Sphere server accepts the client's request, it starts 
an SPE and binds it to the local Sphere operator. The SPE 
runs in a loop and consists of the following four steps: 

1. The SPE accepts a new data segment from the client, 
which contains the file name, offset, number of rows to 
be processed, and additional parameters. 

2. The SPE reads the data segment and its record in- 
dex from local disk or from a remote disk managed by 
Sector. 

3. For each data segment (single data record, group of 
data records, or entire data file), the Sphere opera- 
tor processes the data segment and writes the result 
to a temporary buffer. In addition, the SPE period- 
ically sends acknowledgments to the client about the 
progress of the processing. 

4. When the data segment is completely processed, the 
SPE sends an acknowledgment to the client and writes 
the results to the appropriate destinations, as specified 
in the output stream. If there are no more data seg- 
ments to be processed, the client closes the connection 
to the SPE, and the SPE is released. 

Sphere assigns SPEs to streams as follows: 

1. The stream is first divided into data segments. This is 
done roughly as follows. The total data size S and the 
total number of records R is computed. Say the num- 
ber of SPEs available for the job is N. Roughly speak- 
ing, the number of records that equals S/N should be 
assigned to each SPE. The user specifies a minimum 
and maximum data size S m in and Smax that should 
be assigned to each processor. If S/N is between these 
user defined limits, the associated number of records is 
assigned to each SPE. Otherwise the nearest boundary 
Smin or Smax is used instead to compute the required 
number of records to assign to each SPE. 

2. Once the stream is segmented into data segments of 
the appropriate size, each data segment is assigned to 
a SPE on the same machine whenever possible. 

3. Data segments from the same file are not processed at 
the same time, unless not doing so would result in an 
idle SPE. 

4. DESIGN OF SECTOR 

Sector is the underlying storage cloud that provides persis- 
tent storage for the data required by Sphere and manages 
the data for Sphere operations. Since some portions of Sec- 
tor have been described previously [14], we present just a 
brief summary here. Sector is not a file system per se, but 
rather provides services that rely in part on the local native 
file systems. 

The core requirements for Sector are: 
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Figure 2: This figure illustrates how Sphere operators process Sphere streams over distributed Sphere Pro- 
cessing Elements (SPE). 




Figure 3: With Sector, only users in a community 
who have been added to the Sector access control 
can write data into Sector. On the other hand, any 
member of the community or of the public can read 
data, unless additional restrictions are imposed. 



1. Sector provides long term archival storage and access 
for large distributed datasets. 

2. Sector is designed to utilize the bandwidth available 
on wide area high performance networks. 

3. Sector supports a variety of different routing and net- 
work protocols. 

4. Sector is designed to support a community of users, 
not all of whom may have write access to the Sector 
infrastructure. 



Sector uses replication in order to safely archive data. It 
monitors the number of replicas, and, when necessary, cre- 
ates additional replicas at a random location. The number 
of replicas of each file is checked once per day. The choice of 
random location leads to uniform distribution of data over 
the whole system. 

Sector takes advantage of wide area, high performance net- 
works by using specialized network transport protocols such 
as UDT [13]. Sector also caches data connections. There- 
fore, frequent data transfers between the same pair of nodes 
do not need to set up a data connection every time. This 
reduces the connection setup overhead. 

Sector has separate layers for routing and transport and in- 
terfaces with these layers through well defined APIs. In this 
way, it is relatively straightforward to use other routing or 
network protocols. In addition, UDT is designed in such a 
way that a variety of different network protocols can be used 
simply by linking in one of several different libraries [13] . 

Sector's security mechanism is enabled by Access Control 
List (ACL). While data read is open to the general public, 
write access to the Sector system is controlled by ACL, as 



the client's IP address must appear in the server's ACL in 
order to upload data to that particular server. See Figure|3] 

Sector was designed to provide persistent storage services 
for data intensive applications that involve mining multi- 
terabyte datasets accessed over wide area 10 Gb/s networks. 

As an example, Sector is used to archive and to distribute the 
Sloan Digital Sky Survey (SDSS) to astronomers around the 
world. Using Sector, the SDSS BESTDR5 catalog, which is 
about 1.3TB when compressed, can be transported at ap- 
proximately 8.1 Gb/s over a 10 Gb/s wide area network with 
only 6 commodity servers [11] . 

Sector assumes that large datasets are divided into multi- 
ple files, say file01.dat, file02.dat, etc. It also assumes that 
each file is organized into records. In order to randomly ac- 
cess a record in the data set, each data file in Sector has a 
companion index file, with a post-fix of ".idx". Continuing 
the example above, there would be index files file01.dat. idx, 
file02.dat. idx, etc. The data file and index file are always 
co-located on the same node. Whenever Sector replicates 
the data file, the index file is also replicated. 

The index contains the start and end positions (i.e., the 
offset and size) of each record in the data file. For those 
data files without an index, Sphere can only process them 
at the file level, and the user must write a function that 
parses the file and extracts the data. 

A Sector client accesses data using Sector as follows: 



The Sector client connects to a known Sector server 
S, and requests the locations of an entity managed by 
Sector using the entity's name. 

The Sector Server S runs a look-up inside the server 
network using the services from the routing layer and 
returns one or more locations to the client. In general, 
an entity managed by Sector is replicated several times 
within the Sector network. The routing layer can use 
information involving network bandwidth and latency 
to determine which replica location should be provided 
to the client. 

The client requests a data connection to one or more 
servers on the returned locations using a specialized 
Sector library designed to provide efficient message 
passing between geographically distributed nodes. The 
Sector library used for messaging uses a specialized 
protocol developed for Sector called the Group Mes- 
saging Protocol. 

All further requests and responses are performed us- 
ing a specialized library for high performance network 
transport called UDT [13] . UDT is used over the data 
connection established by the message passing library. 
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Figure 4: Sector consists of several layered services. 



support large distributed datasets, with loose management 
provided by geographically distributed clusters connected by 
a high performance wide area network. With this configu- 
ration, a peer-to-peer routing protocol (the Chord protocol 
described in 20 ) is used so that nodes can be easily added 



and removed from the system. 

The next version of Sector will support specialized routing 
protocols designed for wide area clouds with uniform band- 
width and approximately equal RTT between clusters, as 
well as non-uniform clouds in which bandwidth and RTT 
may vary widely between different clusters of the cloud. 

Data transport within Sector is done using specialized net- 
work protocols. In particular, data channels within Sector 
use high performance network transport protocols, such as 
UDT ~ 
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UDT is a rate-based application layer network 
transport protocol that supports large data flows over wide 
area high performance networks. UDT is fair to several large 
data flows in the sense that it shares bandwidth equally be- 
tween them. UDT is also friendly to TCP flows in the sense 
that it backs off when congestion occurs, enabling any TCP 
flows sharing the network to use the bandwidth they require. 

Message passing with Sector is done using a specialized net- 
work transport protocol that we developed for this purpose 
called the Group Messaging Protocol or GMP. 

6. EXPERIMENTAL STUDIES 
6.1 Experimental Setup 

The wide area experiments use 6 servers: two are in Chicago, 
Illinois; two are in Greenbelt, Maryland; and two are in 
Pasadena, California. The wide area servers have double 
dual-core 2.4 GHz Opteron processors, 4GB RAM, 10GE 
MyriNet NIC, and 2TB of disk. 

The round trip time (RTT) between the servers in Green- 
belt and Chicago is 16ms. The RTT between Chicago and 
Pasadena is 55ms. The servers in Greenbelt and Pasadena 
are networked through Chicago and therefore the RTT is 71 
ms. All the servers are connected with 10 Gb/s networks. 

The local area experiments use 8 servers that have dual 4- 
core 2.4 GHz Xeon processors, 16GB RAM, 10GE MyriNet 
NIC, and 5.5TB of disk. Note that the servers for the local 
area experiments are newer than those used for the wide 
area experiments. 



5. DESIGN OF NETWORKING LAYER 

As mentioned above, Sector is designed to support a variety 
of different routing and networking protocols. The version 
used for the experiments described below are designed to 



The version of Hadoop used for the experimental studies was 
version 0.16.0. The Java(tm) version was 1.6.0, the Java(tm) 
SE Runtime Environment was build 1.6.0-bl05; the Java 
HotSpot(tm) 64-Bit Server VM was build 1.6.0-M05, mixed 
mode. 



6.2 Hadoop vs Sphere - Geographically Dis- 
tributed Locations 

In this section, we perform the tests using Terasort but this 
time using six servers that are geographically distributed. 
Two of the servers are in Chicago Illinois, two are in Pasadena, 
California, and two are in Greenbelt, Maryland. All the 
servers are connected with a 10 Gb/s network. 

Table [l] compares the performance of the Terasort bench- 
mark (sorting 10GB data on each node, 100-byte record with 
10-byte key) using both Hadoop and Sphere. 

To understand the performance of Sector/Sphere for typical 
data mining computations, we developed a benchmark that 
we call Terasplit. Terasplit takes data that has been sorted, 
for example by Terasort, and computes a single split for a 
tree based upon entropy [4] . Although Terasplit benchmarks 
could be developed for multiple clients, the version we use for 
the experiments here read (possibly distributed) data into a 
single client to compute the split. Table [I] also compares the 
performance of Sector/Sphere and Hadoop for the Terasplit 
benchmark. 

6.3 Hadoop vs Sphere - Single Location 

In this section we describe some comparisons between Sphere 
and Hadoop [3] on 8-node Linux cluster in a single location. 
As for the wide area experiments, we ran both the Terasort 
and Terasplit benchmarks. 

The file generation required 212 seconds per file per node for 
Hadoop, which is a throughput of 440Mb/s per node. For 
Sphere, the file generation required 68 seconds per node, 
which is a throughput of l.lGb/s per node. 

Both Hadoop and Sphere scale very well with respect to the 
Terasort and Terasplit benchmarks, as the table indicates. 
Sphere is about 1.6-2.3 times faster than Hadoop as mea- 
sured by the Terasort benchmark and about 1.2-1.5 times 
faster as measured by the Terasplit benchmark. 

Although we expected Sector/Sphere to be faster for the 
wide area experiments, we did not expect to see such a dif- 
ference for the local area experiments. This may be due in 
part to our ability to tune Sphere more proficiently than we 
can tune Hadoop. Also, we noted that Hadoop performed 
better on clusters employing 1 Gb/s NICs than 10 Gb/s 
NICs. Sector/Sphere has been tested extensively using 10 
Gb/s NICs and Hadoop may not have been. 

6.4 Discussion 

As mentioned above, for the Terasort benchmark, Sector/Sphere 
only uses one of the 4 available cores, while Hadoop uses all 
4 cores. For this reason, the Terasort performance is not 
exactly comparable. 

Note that Sector/Sphere provides a performance improve- 
ment of approximately 2.4-2.6 over a wide area network 
compared to Hadoop as measured by the Terasort bench- 
mark, a performance improvement of 1.6-1.8 as measured 
by the Terasplit benchmark, and a performance improve- 
ment of 2.1-2.3 for the Terasort+Terasplit benchmark. 



For local area clusters, Sector/Sphere is about 1.6-2.3 times 
faster as measured by the Terasort benchmark and 1.2-1.5 
times faster as measured by the Terasplit benchmark. As 
mentioned above, this difference may be due to the fact that 
Hadoop has not been tuned to work with 10 Gb/s NICs. 

Note that from the experimental studies reported in Table[l] 
both Sector/Sphere scale to wide area networks. Specifi- 
cally, note that Sector/Sphere scales to four nodes in two dis- 
tributed locations over a network with a RTT of 16 ms with 
a performance impact of approximately 41%, (for Hadoop, 
the impact is also approximately 41%). For three locations, 
with RTT of 16 ms, 55 ms and 71 ms between, the perfor- 
mance impact is approximately 82%, while for Hadoop the 
impact is approximately 67%. 

6.5 Availability and Repeatability 

Version 1.4 of Sector Sphere was used for the experimental 
studies described here. This version of Sector (as well as 
previous versions) is available from the Source Forge web 
site [19]. 

The Terasort benchmark is available from [3] . The Terasplit 
benchmark will be available with the next release of Sector 
[19] ; in the interim, it can be downloaded from [6]. 

The Angle data set (used in the application below) is avail- 
able from the Large Data Archive |6j. 

With the Sector/Sphere software from Source Forge, the 
Terasort and Terasplit benchmarks, and the Angle datasets 
from the Large Data Archive, the experiments may all be 
repeated. The results may vary somewhat depending upon 
the specific servers used, the networks connecting them, and 
the other network traffic present. 

7. SPHERE APPLICATIONS 

We have built several applications with Sector and Sphere. 
In this section, we describe one of them. 

7.1 Angle 

Angle is a Sphere application that identifies anomalous or 
suspicious behavior in TCP packet data that is collected 
from multiple, geographically distributed sites. Angle con- 
tains Sensor Nodes that are attached to the commodity In- 
ternet and collect IP data. Connected to each Sensor Node 
on the commodity network is a Sector node on a wide area 
high performance network. The Sensor Nodes zero out the 
content, hash the source and destination IP to preserve pri- 
vacy, package moving windows of anonymized packets in 
pcap files [2] for further processing, and transfer these files 
to its associated Sector node. Sector services are used to 
manage the data collected by Angle and Sphere services are 
used to identify anomalous or suspicious behavior. 

Angle Sensors are currently installed at four locations: the 
University of Illinois at Chicago, the University of Chicago, 
Argonne National Laboratory and the ISI/University of South- 
ern California. Each day, Angle processes approximately 575 
pcap files totaling approximately 7.6GB and 97 million pack- 
ets. To date, we have collected approximately 300,000 pcap 
files. 
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Table 1: This table compares the performance of Sphere and Hadoop sorting a 10GB file on each of six 
nodes that are distributed over a wide area network using the Terasort benchmark. The performance using 
the Terasplit benchmark is also reported, as is the total for Terasort plus Terasplit. The speedup of Sphere 
compared to Hadoop is reported for the Terasort and Terasplit benchmarks, as well as the total of the two. 
Nodes 1 and 2 are located in Chicago; nodes 3 and 4 are located in Pasadena; nodes 5 and 6 are located in 
Greenbelt. All measurements are in seconds. The nodes were double dual-core 2.4 GHz Opteron processors 
with 4 GB of memory. N.B. Different types of servers were used for the local and wide area tests. 



Nodes Used 


1 


1-2 


1-3 


1-4 


1-5 


1-6 


1-7 


1-8 


Size of Dataset (GB) 


10 


20 


30 


40 


50 


60 


70 


80 


Hadoop Terasort 


645 


766 


768 


773 


815 


882 


901 


1000 


Sphere Terasort 


408 


409 


410 


429 


430 


436 


440 


443 


Hadoop Terasplit 


141 


266 


410 


544 


671 


901 


1133 


1250 


Sphere Terasplit 


96 


221 


350 


462 


560 


663 


754 


855 


Total Hadoop 


786 


1032 


1178 


1317 


1486 


1784 


2034 


2250 


Total Sphere 


504 


630 


760 


891 


990 


1099 


1194 


1298 


Speedup Terasort 


1.6 


1.9 


1.9 


1.8 


1.9 


2.0 


2.0 


2.3 


Speedup Terasplit 


1.5 


1.2 


1.2 


1.2 


1.2 


1.4 


1.5 


1.5 


Speedup total 


1.6 


1.6 


1.6 


1.5 


1.5 


1.6 


1.7 


1.7 



Table 2: This table compares the performance of Sphere and Hadoop sorting a 10GB file on each of eight 
nodes, all of which are located on a single rack. All measurements are in seconds. The nodes were dual quad 
core 2.4 GHz Xeon processors with 16 GB of memory. 



Briefly, Angle Sensor nodes collects IP data, anonymizes 
the IP data, and produces pcap files that are then man- 
aged by Sector. Sphere aggregates the pcap files by source 
IP (or other specified entity) and computes files containing 
features. 

Sphere is also used for processing the feature files to identify 
emergent behavior. This is done in several ways. One way is 
for Sphere to aggregate feature files into temporal windows, 
u>i, W2, W3, . . ., where each window is length d. For each 
window Wj, clusters are computed with centers a 3 -,i, dj-,2, 
dj,k and the temporal evolution of these clusters is used to 
identify certain clusters called emergent clusters. For exam- 
ple, if the clusters are relatively stable for windows wi, W2, 
. . ., w a , but there is statistically significant change in the 
clusters in w a +i, then one or more clusters from window 
Wa+i can be identified. These clusters are called emergent 
clusters. 

The following simple statistic can be used 

8j = I min ||a,> - a J+ i, m || 2 ) . 

£ — * \n^m / 
i— 1 v ' 

Figure [5] shows this graph for windows of length d equals 10 
minutes. Notice that the statistic Sj is quite choppy. On the 
other hand, Figure [6] shows the same statistic for windows 
of length d equals 1 day. 




Figure 6: The graph above shows how the cluster 
centers move from one 1-day window to another as 
measured by the statistic Sj. Emergent clusters were 
identified for the three days indicated and used as a 
basis for scoring functions. 



Number records 


Number of 
Sector Files 


Time 


500 


1 


1.9 s 


1000 


3 


4.2 s 


1,000,000 


2850 


85 min 


100,000,000 


300,000 


178 hours 



Table 3: The time spent clustering using Sphere 
scales as the number of files managed by Sector in- 
creases. 



Given one or more emergent clusters, a simple scoring func- 
tion can be used to identify feature vectors with emergent 
behavior. For example, if are constants that sum to 1, dk 
is the center of an emergent cluster and a\ is its variance, 
then the following score can be used to score feature vectors 
x 

p(x) = maxpt(i) 

k 

1 \ a ( ~^l\\ x - a k\\ 2 \ 
p k [x) = 6 k exp I — 1 , 

where the max is over emergent clusters k. 

See |12| for more details. 




Figure 5: The graph above shows how the cluster 
centers move from one ten minute window to an- 
other as measured by the statistic Sj. 

Table [3] shows the performance of Sector and Sphere when 
computing cluster models as described above from distributed 
pcap files. In this table, the work load varies from 1 to 
300,000 distributed pcap files. This corresponds to approx- 
imately 500 to 100,000,000 feature vectors (each pcap file 
results in one file of features, which are then aggregated and 



clustered, but a feature file can contain various numbers of 
different feature vectors). 

8. SUMMARY AND CONCLUSION 

In this paper, we have described a cloud-based infrastruc- 
ture designed for data mining large distributed data sets over 
clusters connected with high performance wide area net- 
works. Sector/Sphere is open source and available through 
Source Forge. We have used it as a basis for several dis- 
tributed data mining applications. 

The infrastructure consists of the Sector storage cloud and 
the Sphere compute cloud. We have described the design of 
Sector and Sphere and showed through experimental stud- 
ies that Sector/Sphere can process large datasets that are 
distributed across the continental U.S. with a performance 
penalty of approximately 80% compared to the time required 
if all the data were located on a single rack. Sector/Sphere 
utilize a specialized networking layer to achieve this perfor- 
mance. 

We have also described a Sector/Sphere application to detect 
emergent behavior in network traffic and showed that for 
this application Sector/Sphere can compute clusters on over 
300,000 distributed files. 

Finally, we performed experimental studies on a wide area 
testbed and demonstrated that Sector/Sphere is approxi- 
mately 2.4-2.6 times faster than Hadoop [3 using the Tera- 
sort benchmark supplied with Hadoop. Using a benchmark 
we developed call Terasplit that computes a single split in a 
classification and regression tree, we found that Sector/Sphere 
was about 1.6-1.9 times faster than Hadoop. 
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